As per Kaggle Website:
Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Throughout this analysis we will perform feature engineering on predictors one by one in order to try to form the most accurate prediction for house prices. By the end of this analysis the models of random forest, lasso, and cubist will be performed in order to make the final predictions. Of the three, cubist ultimately performed the best and was chosen as the final submission.
library(ggplot2)
library(plyr,include.only = "revalue")
library(dplyr)
library(caret)
library(gridExtra)
library(e1071) #naive bayes
library(corrplot)
library(Metrics)
library(earth)
library(knitr)
library(gt)
options(scipen=999)
There will be two data sets used in this analysis, the train data set and the test data set. The train data set contains 1460 observations and 81 variables and will be used for the data analysis and model building. The test data set will only used in the end to make the final predictions and contains 1459 rows and 80 variables, with the missing variable being the sale price which we are trying to predict.
Train Data Set:
train <- read.csv("train.csv",stringsAsFactors = FALSE)
gt(head(train))
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 60 | RL | 65 | 8450 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NA | Attchd | 2003 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2008 | WD | Normal | 208500 |
| 2 | 20 | RL | 80 | 9600 | Pave | NA | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 5 | 2007 | WD | Normal | 181500 |
| 3 | 60 | RL | 68 | 11250 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 9 | 2008 | WD | Normal | 223500 |
| 4 | 70 | RL | 60 | 9550 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NA | NA | NA | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 5 | 60 | RL | 84 | 14260 | Pave | NA | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 12 | 2008 | WD | Normal | 250000 |
| 6 | 50 | RL | 85 | 14115 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Mitchel | Norm | Norm | 1Fam | 1.5Fin | 5 | 5 | 1993 | 1995 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | Wood | Gd | TA | No | GLQ | 732 | Unf | 0 | 64 | 796 | GasA | Ex | Y | SBrkr | 796 | 566 | 0 | 1362 | 1 | 0 | 1 | 1 | 1 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1993 | Unf | 2 | 480 | TA | TA | Y | 40 | 30 | 0 | 320 | 0 | 0 | NA | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
Test Data Set:
test <- read.csv("test.csv",stringsAsFactors = FALSE)
gt(head(test))
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1461 | 20 | RH | 80 | 11622 | Pave | NA | Reg | Lvl | AllPub | Inside | Gtl | NAmes | Feedr | Norm | 1Fam | 1Story | 5 | 6 | 1961 | 1961 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | CBlock | TA | TA | No | Rec | 468 | LwQ | 144 | 270 | 882 | GasA | TA | Y | SBrkr | 896 | 0 | 0 | 896 | 0 | 0 | 1 | 0 | 2 | 1 | TA | 5 | Typ | 0 | NA | Attchd | 1961 | Unf | 1 | 730 | TA | TA | Y | 140 | 0 | 0 | 0 | 120 | 0 | NA | MnPrv | NA | 0 | 6 | 2010 | WD | Normal |
| 1462 | 20 | RL | 81 | 14267 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | NAmes | Norm | Norm | 1Fam | 1Story | 6 | 6 | 1958 | 1958 | Hip | CompShg | Wd Sdng | Wd Sdng | BrkFace | 108 | TA | TA | CBlock | TA | TA | No | ALQ | 923 | Unf | 0 | 406 | 1329 | GasA | TA | Y | SBrkr | 1329 | 0 | 0 | 1329 | 0 | 0 | 1 | 1 | 3 | 1 | Gd | 6 | Typ | 0 | NA | Attchd | 1958 | Unf | 1 | 312 | TA | TA | Y | 393 | 36 | 0 | 0 | 0 | 0 | NA | NA | Gar2 | 12500 | 6 | 2010 | WD | Normal |
| 1463 | 60 | RL | 74 | 13830 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 5 | 5 | 1997 | 1998 | Gable | CompShg | VinylSd | VinylSd | None | 0 | TA | TA | PConc | Gd | TA | No | GLQ | 791 | Unf | 0 | 137 | 928 | GasA | Gd | Y | SBrkr | 928 | 701 | 0 | 1629 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1997 | Fin | 2 | 482 | TA | TA | Y | 212 | 34 | 0 | 0 | 0 | 0 | NA | MnPrv | NA | 0 | 3 | 2010 | WD | Normal |
| 1464 | 60 | RL | 78 | 9978 | Pave | NA | IR1 | Lvl | AllPub | Inside | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 6 | 1998 | 1998 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 20 | TA | TA | PConc | TA | TA | No | GLQ | 602 | Unf | 0 | 324 | 926 | GasA | Ex | Y | SBrkr | 926 | 678 | 0 | 1604 | 0 | 0 | 2 | 1 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Attchd | 1998 | Fin | 2 | 470 | TA | TA | Y | 360 | 36 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 6 | 2010 | WD | Normal |
| 1465 | 120 | RL | 43 | 5005 | Pave | NA | IR1 | HLS | AllPub | Inside | Gtl | StoneBr | Norm | Norm | TwnhsE | 1Story | 8 | 5 | 1992 | 1992 | Gable | CompShg | HdBoard | HdBoard | None | 0 | Gd | TA | PConc | Gd | TA | No | ALQ | 263 | Unf | 0 | 1017 | 1280 | GasA | Ex | Y | SBrkr | 1280 | 0 | 0 | 1280 | 0 | 0 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 0 | NA | Attchd | 1992 | RFn | 2 | 506 | TA | TA | Y | 0 | 82 | 0 | 0 | 144 | 0 | NA | NA | NA | 0 | 1 | 2010 | WD | Normal |
| 1466 | 60 | RL | 75 | 10000 | Pave | NA | IR1 | Lvl | AllPub | Corner | Gtl | Gilbert | Norm | Norm | 1Fam | 2Story | 6 | 5 | 1993 | 1994 | Gable | CompShg | HdBoard | HdBoard | None | 0 | TA | TA | PConc | Gd | TA | No | Unf | 0 | Unf | 0 | 763 | 763 | GasA | Gd | Y | SBrkr | 763 | 892 | 0 | 1655 | 0 | 0 | 2 | 1 | 3 | 1 | TA | 7 | Typ | 1 | TA | Attchd | 1993 | Fin | 2 | 440 | TA | TA | Y | 157 | 84 | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 4 | 2010 | WD | Normal |
There exist many variables with at least one missing value in train set.
naDF <- data.frame(sapply(train,function(x) sum(is.na(x))))
names(naDF) <- "NA_Count"
naDF$variable <- row.names(naDF)
naDF <- naDF %>%
filter(NA_Count>0) %>%
arrange(desc(NA_Count))
ggplot(naDF,aes(x=reorder(variable,NA_Count),y=NA_Count,fill=variable))+
geom_col()+
geom_text(aes(label=NA_Count), hjust=0)+
coord_flip()+
theme(legend.position = "none")+
labs(title="NA Counts of Train Data",
subtitle = "With at least 1 NA")+
ylab("NA Count")+
xlab("Train Variables")+
ylim(0,1500)
Most of the variables that contain NA values are variables where the NA value is known. For example, the Alley variable replaces the level of “none” with NA and the Basement Quality variable replaces the level of “no basement” with NA.
naDF <- naDF %>%
mutate(knownMissing=ifelse(variable %in% c("Alley","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PoolQC","Fence","MiscFeature"),"Known","Unknown"))
ggplot(naDF,aes(x=as.factor(knownMissing),fill=knownMissing))+
geom_bar()+
geom_text(stat="count",aes(label=..count..), vjust=-1)+
xlab("Known vs Unknown NA's")+
ylab("Count")+
labs(title="Count of Variables with Known vs Unknown NA values")+
guides(fill=guide_legend(title="Known vs Unknown"))+
ylim(0,15)
The fourteen variables that have known missing values are Alley, Basement Quality, Basement Condition, Basement Exposure, Basement Finish Type1, Basement Finish Type2, Fire Place Quality, Garage Type, Garage Finish, Garage Quality, Garage Condition, Pool Quality, Fence, and Miscellaneous. The appropriate values will all be added to the train data now to replace the known NA values.
train[is.na(train$Alley),"Alley"] <- "none"
for(name in c("BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2")){
train[is.na(train[,name]),name] <- "none"
}
train[is.na(train$FireplaceQu),"FireplaceQu"] <- "none"
for(name in c("GarageType","GarageFinish","GarageQual","GarageCond")){
train[is.na(train[,name]),name] <- "none"
}
train[is.na(train$PoolQC),"PoolQC"] <- "none"
train[is.na(train$Fence),"Fence"] <- "none"
train[is.na(train$MiscFeature),"MiscFeature"] <- "none"
The same will be done for the test set.
test[is.na(test$Alley),"Alley"] <- "none"
for(name in c("BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2")){
test[is.na(test[,name]),name] <- "none"
}
test[is.na(test$FireplaceQu),"FireplaceQu"] <- "none"
for(name in c("GarageType","GarageFinish","GarageQual","GarageCond")){
test[is.na(test[,name]),name] <- "none"
}
test[is.na(test$PoolQC),"PoolQC"] <- "none"
test[is.na(test$Fence),"Fence"] <- "none"
test[is.na(test$MiscFeature),"MiscFeature"] <- "none"
Only five variables have remaining NA values. These values will be calulated in the feature engineering section.
#creating na table
naDF <- data.frame(sapply(train,function(x) sum(is.na(x))))
names(naDF) <- "NA_Count"
naDF$variable <- row.names(naDF)
naDF <- naDF %>%
select(variable,NA_Count) %>%
filter(NA_Count>0)
ggplot(naDF,aes(x=variable,y=NA_Count,fill=variable))+
geom_col(aes(x=reorder(variable,desc(NA_Count))))+
geom_text(aes(label=NA_Count), vjust=-1)+
xlab("Variable")+
ylab("NA Count")+
labs(title="Remaining NA Values")+
theme(legend.position = "none")+
ylim(0,275)
There exists strong correlations with some of the predictors with correlation values above .8.
numericPredictors <- train[complete.cases(train),c(-1,-81)] %>% select_if(is.numeric) #1 = ID 81=SalesPrice
correlations <- as.data.frame(as.table(cor(numericPredictors)))
correlations <- correlations %>% filter(Freq!=1) %>% arrange(desc(Freq)) %>% rename(correlation=Freq)
gt(correlations[seq(1,nrow(correlations),by=2),] %>% head(10))
| Var1 | Var2 | correlation |
|---|---|---|
| GarageArea | GarageCars | 0.8396258 |
| X1stFlrSF | TotalBsmtSF | 0.8358336 |
| TotRmsAbvGrd | GrLivArea | 0.8244260 |
| GarageYrBlt | YearBuilt | 0.8233292 |
| GrLivArea | X2ndFlrSF | 0.6885976 |
| BsmtFullBath | BsmtFinSF1 | 0.6514960 |
| TotRmsAbvGrd | BedroomAbvGr | 0.6502724 |
| GarageYrBlt | YearRemodAdd | 0.6454552 |
| YearRemodAdd | YearBuilt | 0.6228006 |
| TotRmsAbvGrd | X2ndFlrSF | 0.6177579 |
corrplot(cor(numericPredictors))
One of the predictors from each pair of correlated predictors with a correlation value above .8 will be dropped. Since garage year built has eighty-one NA values it will be dropped from that pair. The choice of which one to drop from the remaining three pairs will be based off which predictor has a lower correlation with the dependent variable of Sales Price. The predictors that will be dropped are total rooms above ground, first floor sf, and garage area.
numericPredictors <- cbind(SalePrice=train[complete.cases(train),"SalePrice"],numericPredictors)
correlationWithSalePrice <- as.data.frame(sapply(numericPredictors,function(x) cor(x,numericPredictors$SalePrice,use="complete.obs")))
names(correlationWithSalePrice) <- "correlation"
correlationWithSalePrice$predictor <- row.names(correlationWithSalePrice)
correlationWithSalePrice %>% filter(predictor %in% c("GarageArea","GarageCars","X1stFlrSF","TotalBsmtSF","TotRmsAbvGrd","GrLivArea")) %>% arrange(desc(correlation))
## correlation predictor
## GrLivArea 0.7051392 GrLivArea
## GarageCars 0.6470931 GarageCars
## GarageArea 0.6193185 GarageArea
## TotalBsmtSF 0.6159898 TotalBsmtSF
## X1stFlrSF 0.6080921 X1stFlrSF
## TotRmsAbvGrd 0.5471478 TotRmsAbvGrd
ggplot function for continuous data:
#train with complete observations only
numericggplot <- function(variable){
histo <- ggplot(train,aes(x=train[,variable]))+
geom_histogram()+
xlab(variable)+
theme(axis.text=element_text(size=14),
axis.title=element_text(size=14,face="bold"))+
labs(title=variable)
scattero <- ggplot(train,aes(x=train[,variable],y=SalePrice))+
geom_point()+
geom_smooth(method = "lm", se=FALSE, color="black")+
xlab(variable)+
ylab("SalePrice")+
theme(axis.text=element_text(size=14),
axis.title=element_text(size=14,face="bold"))+
labs(title=variable)
grid.arrange(histo,scattero,ncol=2)
}
ggplot function for discrete data:
discreteggplot <- function(name){
discretePlot <- ggplot(train,aes(x=as.factor(train[,name])))+
geom_bar()+xlab(name)
discreteBox <- ggplot(train,aes(x=as.factor(train[,name]),y=SalePrice))+
geom_boxplot()+xlab(name)
grid.arrange(discretePlot,discreteBox,ncol=2)
}
The dependent variable we are trying to predict is SalePrice. SalePrice represents the prices of houses in Ames, Iowa. It exists in the train data set, but not in the test data set. It is skewed to the right with some very high trailing values.
salePriceHisto <- ggplot(train,aes(x=SalePrice))+
geom_histogram()+
labs(title="SalePrice")
salePriceBox <- ggplot(train,aes(x=SalePrice))+
geom_boxplot()+
labs(title="SalePrice")
grid.arrange(salePriceHisto,salePriceBox,ncol=2)
By applying the log transformation the sale price becomes normally distributed.
salePriceHistoLog <- ggplot(train,aes(x=log(SalePrice)))+
geom_histogram()+
labs(title="SalePrice")+
xlab("log(SalePrice)")
salePriceBoxLog <- ggplot(train,aes(x=log(SalePrice)))+
geom_boxplot()+
labs(title="SalePrice")+
xlab("log(SalePrice)")
grid.arrange(salePriceHistoLog,salePriceBoxLog,ncol=2)
LotFrontage: Linear feet of street connected to a property. There appears to be two outliers with lot frontage above 300 including observations 935 and 1299. Due to the high number of missing values more sophisticated methods of imputation will have to be used.
numericggplot("LotFrontage")
kable(train %>% filter(LotFrontage>300))
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 935 | 20 | RL | 313 | 27650 | Pave | none | IR2 | HLS | AllPub | Inside | Mod | NAmes | PosA | Norm | 1Fam | 1Story | 7 | 7 | 1960 | 2007 | Flat | Tar&Grv | Wd Sdng | Wd Sdng | None | 0 | TA | TA | CBlock | Gd | TA | Gd | GLQ | 425 | Unf | 0 | 160 | 585 | GasA | Ex | Y | SBrkr | 2069 | 0 | 0 | 2069 | 1 | 0 | 2 | 0 | 4 | 1 | Gd | 9 | Typ | 1 | Gd | Attchd | 1960 | RFn | 2 | 505 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | none | none | none | 0 | 11 | 2008 | WD | Normal | 242000 |
| 1299 | 60 | RL | 313 | 63887 | Pave | none | IR3 | Bnk | AllPub | Corner | Gtl | Edwards | Feedr | Norm | 1Fam | 2Story | 10 | 5 | 2008 | 2008 | Hip | ClyTile | Stucco | Stucco | Stone | 796 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 5644 | Unf | 0 | 466 | 6110 | GasA | Ex | Y | SBrkr | 4692 | 950 | 0 | 5642 | 2 | 0 | 2 | 1 | 3 | 1 | Ex | 12 | Typ | 3 | Gd | Attchd | 2008 | Fin | 2 | 1418 | TA | TA | Y | 214 | 292 | 0 | 0 | 0 | 480 | Gd | none | none | 0 | 1 | 2008 | New | Partial | 160000 |
LotArea: Lot size in square feet. Contains only four values above 100,000 including observations 250, 314, 336, and 707.
numericggplot("LotArea")
kable(train[train$LotArea>100000,])
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 250 | 250 | 50 | RL | NA | 159000 | Pave | none | IR2 | Low | AllPub | CulDSac | Sev | ClearCr | Norm | Norm | 1Fam | 1.5Fin | 6 | 7 | 1958 | 2006 | Gable | CompShg | Wd Sdng | HdBoard | BrkCmn | 472 | Gd | TA | CBlock | Gd | TA | Gd | Rec | 697 | Unf | 0 | 747 | 1444 | GasA | Gd | Y | SBrkr | 1444 | 700 | 0 | 2144 | 0 | 1 | 2 | 0 | 4 | 1 | Gd | 7 | Typ | 2 | TA | Attchd | 1958 | Fin | 2 | 389 | TA | TA | Y | 0 | 98 | 0 | 0 | 0 | 0 | none | none | Shed | 500 | 6 | 2007 | WD | Normal | 277000 |
| 314 | 314 | 20 | RL | 150 | 215245 | Pave | none | IR3 | Low | AllPub | Inside | Sev | Timber | Norm | Norm | 1Fam | 1Story | 7 | 5 | 1965 | 1965 | Hip | CompShg | BrkFace | BrkFace | None | 0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 1236 | Rec | 820 | 80 | 2136 | GasW | TA | Y | SBrkr | 2036 | 0 | 0 | 2036 | 2 | 0 | 2 | 0 | 3 | 1 | TA | 8 | Typ | 2 | Gd | Attchd | 1965 | RFn | 2 | 513 | TA | TA | Y | 0 | 0 | 0 | 0 | 0 | 0 | none | none | none | 0 | 6 | 2009 | WD | Normal | 375000 |
| 336 | 336 | 190 | RL | NA | 164660 | Grvl | none | IR1 | HLS | AllPub | Corner | Sev | Timber | Norm | Norm | 2fmCon | 1.5Fin | 5 | 6 | 1965 | 1965 | Gable | CompShg | Plywood | Plywood | None | 0 | TA | TA | CBlock | TA | TA | Gd | ALQ | 1249 | BLQ | 147 | 103 | 1499 | GasA | Ex | Y | SBrkr | 1619 | 167 | 0 | 1786 | 2 | 0 | 2 | 0 | 3 | 1 | TA | 7 | Typ | 2 | Gd | Attchd | 1965 | Fin | 2 | 529 | TA | TA | Y | 670 | 0 | 0 | 0 | 0 | 0 | none | none | Shed | 700 | 8 | 2008 | WD | Normal | 228950 |
| 707 | 707 | 20 | RL | NA | 115149 | Pave | none | IR2 | Low | AllPub | CulDSac | Sev | ClearCr | Norm | Norm | 1Fam | 1Story | 7 | 5 | 1971 | 2002 | Gable | CompShg | Plywood | Plywood | Stone | 351 | TA | TA | CBlock | Gd | TA | Gd | GLQ | 1219 | Unf | 0 | 424 | 1643 | GasA | TA | Y | SBrkr | 1824 | 0 | 0 | 1824 | 1 | 0 | 2 | 0 | 2 | 1 | Gd | 5 | Typ | 2 | TA | Attchd | 1971 | Unf | 2 | 739 | TA | TA | Y | 380 | 48 | 0 | 0 | 0 | 0 | none | none | none | 0 | 6 | 2007 | WD | Normal | 302000 |
\(\textbf{Pick the next tab in order to see other variables.}\)
MiscVal: Value of miscellaneous features. Most have a value of zero inferring no miscellaneous feature. Given that the value is not zero, the average miscellaneous value is 1221.
numericggplot("MiscVal")
miscValTable <- as.data.frame(table(train$MiscVal))
names(miscValTable) <- c("value","frequency")
kable(miscValTable)
| value | frequency |
|---|---|
| 0 | 1408 |
| 54 | 1 |
| 350 | 1 |
| 400 | 11 |
| 450 | 4 |
| 480 | 2 |
| 500 | 8 |
| 560 | 1 |
| 600 | 4 |
| 620 | 1 |
| 700 | 5 |
| 800 | 1 |
| 1150 | 1 |
| 1200 | 2 |
| 1300 | 1 |
| 1400 | 1 |
| 2000 | 4 |
| 2500 | 1 |
| 3500 | 1 |
| 8300 | 1 |
| 15500 | 1 |
#average price of misc. value given not 0
round(mean(train[!is.na(train$MiscVal) & train$MiscVal!=0,"MiscVal"]),1)
## [1] 1221
\(\textbf{Pick the next tab in order to see other variables.}\)
YearBuilt: Original construction date. The newer the house, the higher the house prices are. However, there are a couple houses built before 1900 that are very expensize. Investigating we can see that all four of these houses had a recent remodel. A predictor will be added for houses that have a remodel year that is not equal to the year the house was built.
numericggplot("YearBuilt")
kable(train[train$SalePrice>200000 & train$YearBuilt<1900 & !is.na(train$SalePrice),c("YearBuilt","YearRemodAdd","SalePrice")])
| YearBuilt | YearRemodAdd | SalePrice | |
|---|---|---|---|
| 186 | 1892 | 1993 | 475000 |
| 305 | 1880 | 2002 | 295000 |
| 584 | 1893 | 2000 | 325000 |
| 748 | 1880 | 2003 | 265979 |
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions). Fifty-two percent of houses did not have a remodel.
numericggplot("YearRemodAdd")
print(paste(round(sum(train$YearRemodAdd==train$YearBuilt)/nrow(train)*100,0),"% of houses did not have a remodel."))
## [1] "52 % of houses did not have a remodel."
GarageYrBlt: Year garage was built. 75% of the time the garage was built when the house was built. There will be an added feature later on to depict whether or not the garage was added after the house was built. All eighty-one of the NA values in garage year built are due to the fact that the house does not have a garage.
numericggplot("GarageYrBlt")
#percentage of time garage was built when house was built
round(sum(train$GarageYrBlt==train$YearBuilt,na.rm=TRUE)/nrow(train),2)
## [1] 0.75
#comparing garage year built with house year built
ggplot(train,aes(x=GarageYrBlt-YearBuilt))+
geom_histogram()+
xlab("Difference in Time")+
labs(title="Difference Between Year House was Built and Year Garage was Built")
table(train[is.na(train$GarageYrBlt),"GarageYrBlt"],train[is.na(train$GarageYrBlt),"GarageFinish"],exclude=FALSE)
##
## none
## <NA> 81
\(\textbf{Pick the next tab in order to see other variables.}\)
GarageCars: Size of garage in car capacity.
discreteggplot("GarageCars")
GarageArea: Size of garage in square feet.
There exists a strong correlation between garage cars and garage area which makes sense. One of the predictors will likely be dropped later on.
numericggplot("GarageArea")
ggplot(train,aes(x=as.factor(GarageCars),y=GarageArea))+
geom_boxplot()+
xlab("Number of Cars in Garage")+
labs(title="Garage Area vs Garage Cars",
subtitle=paste("Correlation:",round(cor(train$GarageCars,train$GarageArea),2)))
\(\textbf{Pick the next tab in order to see other variables.}\)
MasVnrArea: Masonry veneer area in square feet. Most of the values are zero.
numericggplot("MasVnrArea")
MasVnrType: Masonry veneer type. There are two observations that have a masonry vaneer type listed, but have a masonry veneer area of zero. There are also five cases where the Masonry Vaneer Type is none but the Masonry Vaneer Area is not zero. These will be addressed in feature engineering section by replacing these values with their median.
discreteggplot("MasVnrType")
kable(train %>% filter(MasVnrArea==0 & !MasVnrType=="None") %>% select(Id,MasVnrArea,MasVnrType))
| Id | MasVnrArea | MasVnrType |
|---|---|---|
| 689 | 0 | BrkFace |
| 1242 | 0 | Stone |
train[complete.cases(train$MasVnrType),] %>% group_by(MasVnrType) %>% summarize(medianMasVnrArea=median(MasVnrArea,na.rm=TRUE))
## # A tibble: 4 × 2
## MasVnrType medianMasVnrArea
## <chr> <dbl>
## 1 BrkCmn 192
## 2 BrkFace 202
## 3 None 0
## 4 Stone 206.
kable(train %>% filter(MasVnrArea!=0 & MasVnrType=="None") %>% select(Id,MasVnrArea,MasVnrType))
| Id | MasVnrArea | MasVnrType |
|---|---|---|
| 625 | 288 | None |
| 774 | 1 | None |
| 1231 | 1 | None |
| 1301 | 344 | None |
| 1335 | 312 | None |
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtFinSF1: Type 1 finish square feet. Observation 1299 is an outlier with the basement type1 square feet being above 3000.
numericggplot("BsmtFinSF1")
kable(train %>% filter(BsmtFinSF1>3000) %>% select(Id,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF))
| Id | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF |
|---|---|---|---|---|
| 1299 | 5644 | 0 | 466 | 6110 |
BsmtFinSF2: Type 2 finished square feet.
numericggplot("BsmtFinSF2")
BsmtUnfSF: Unfinished square feet of basement area.
numericggplot("BsmtUnfSF")
TotalBsmtSF: Total square feet of basement area. Is the same as the sum of finished square feet 1, finished square feet 2, and unfinished square feet. There is one outlier with a total basement square feet above 6000 which is observation 1299. However, the sum of finished square feet 1, finished square feet 2, and unfinished square feet is equal to the total sum of square feet basement area for each observation meaning no typo errors.
numericggplot("TotalBsmtSF")
sum(train$TotalBsmtSF!=train$BsmtFinSF1+train$BsmtFinSF2+train$BsmtUnfSF)
## [1] 0
\(\textbf{Pick the next tab in order to see other variables.}\)
X1stFlrSF: First Floor square feet.
numericggplot("X1stFlrSF")
X2ndFlrSF: Second floor square feet.
numericggplot("X2ndFlrSF")
LowQualFinSF: Low quality finished square feet (all floors). Only twenty-six houses in the train data set have low quality finished square feet.
numericggplot("LowQualFinSF")
train %>% filter(LowQualFinSF>0) %>% count()
## n
## 1 26
GrLivArea: Above grade (ground) living area square feet.
First floor square feet + second floor square feet + low quality square feet = above ground living area for each observation. Observation 1299 comes up again as an outlier, will likely be dropped in feature engineering stage.
numericggplot("GrLivArea")
sum(train$X1stFlrSF+train$X2ndFlrSF+train$LowQualFinSF!=train$GrLivArea)/nrow(train)
## [1] 0
kable(train[train$GrLivArea>5000,c("Id","SalePrice","GrLivArea","LowQualFinSF","X2ndFlrSF","X1stFlrSF")])
| Id | SalePrice | GrLivArea | LowQualFinSF | X2ndFlrSF | X1stFlrSF | |
|---|---|---|---|---|---|---|
| 1299 | 1299 | 160000 | 5642 | 0 | 950 | 4692 |
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtFullBath: Basement full bathrooms. Only one house has three full bathrooms in the basement. Although it appears unlikely for a basement of that size to have three full bathrooms, it is not entirely impossible. We can analyze this from the box-plot where a basement of a similar size has two full basement bathrooms.
discreteggplot("BsmtFullBath")
ggplot(train,aes(x=as.factor(BsmtFullBath),y=TotalBsmtSF))+
geom_boxplot()+
xlab("Number of Full Basement Bathrooms")+
labs(title="Total Basement sq.ft vs Number of Basement Full Bathrooms")
BsmtHalfBath: Basement half bathrooms.
discreteggplot("BsmtHalfBath")
FullBath: Full bathrooms above grade.
discreteggplot("FullBath")
HalfBath: Half baths above grade.
discreteggplot("HalfBath")
There is no variable for the total bathrooms in a house. This will be added later on in the feature engineering section.
train %>%
mutate(totalBathrooms=BsmtFullBath+BsmtHalfBath+HalfBath+FullBath,na.rm=TRUE) %>%
ggplot(.,aes(x=as.factor(totalBathrooms)))+
geom_bar()+
xlab("Total Bathooms")+
ylab("Count of Bathrooms")+
labs(title="Total Bathrooms")
\(\textbf{Pick the next tab in order to see other variables.}\)
BedroomAbvGr: Bedrooms above ground (does NOT include basement bedrooms).
discreteggplot("BedroomAbvGr")
KitchenAbvGr: Kitchens above ground.
discreteggplot("KitchenAbvGr")
TotRmsAbvGrd: Total rooms above ground (does not include bathrooms).
discreteggplot("TotRmsAbvGrd")
There will be a feature added later on for rooms that are neither kitchen or bedrooms.
train %>% mutate(otherRooms=TotRmsAbvGrd-BedroomAbvGr-KitchenAbvGr) %>%
ggplot(.,aes(x=otherRooms))+
geom_bar()+xlab("Other Rooms")
\(\textbf{Pick the next tab in order to see other variables.}\)
Fireplaces: Number of fireplaces. Nearly all houses have either one or two fireplaces.
discreteggplot("Fireplaces")
\(\textbf{Pick the next tab in order to see other variables.}\)
OpenPorchSF: Open porch area in square feet.
numericggplot("OpenPorchSF")
EnclosedPorch: Enclosed porch area in square feet.
numericggplot("EnclosedPorch")
X3SsnPorch: Three season porch area in square feet.
numericggplot("X3SsnPorch")
ScreenPorch: Screen porch area in square feet.
numericggplot("ScreenPorch")
A predictor will be made later for whether or not a house has a porch. Most houses have a porch.
train %>% mutate(hasPorch=ifelse(ScreenPorch>0 | X3SsnPorch>0 | EnclosedPorch>0 | OpenPorchSF>0,1,0)) %>%
ggplot(.,aes(x=as.factor(hasPorch)))+
geom_bar()+
xlab("Has a Porch")
Could also add a predictor for the total number of porch square feet.
train %>% mutate(totalPorchArea=ScreenPorch+X3SsnPorch+EnclosedPorch+OpenPorchSF) %>%
ggplot(.,aes(x=totalPorchArea))+
geom_histogram()
WoodDeckSF: Wood deck area in square feet.
numericggplot("WoodDeckSF")
\(\textbf{Pick the next tab in order to see other variables.}\)
PoolArea: Pool area in square feet.
numericggplot("PoolArea")
PoolQC: Quality of the pool.
discreteggplot("PoolQC")
MoSold: Month Sold (MM).
Most houses seem to sell in the summer months, but the average house price per month sold seems pretty even.
moSold1 <- ggplot(train,aes(x=as.character(MoSold)))+
geom_bar()+
scale_x_discrete(limits=seq(1,12))+
xlab("Months")
moSold2 <- ggplot(train[!is.na(train$SalePrice),],aes(x=as.character(MoSold),y=SalePrice))+
geom_boxplot()+
scale_x_discrete(limits=seq(1,12))+
xlab("Months")
grid.arrange(moSold1,moSold2,ncol=2)
YrSold: Year Sold (YYYY).
From the previous charts we can see the end of the range of observations is in August 2010, which explains why 2010 has the least number of houses sold. There appears to be a gradual decrease in the median price of a house sold from 2006 to 2010. Both month sold and year sold will be converted to character predictors as there is no apparent ordering of sale price within them.
discreteggplot("YrSold")
OverallQual: Rates the overall material and finish of the house
There is a clear increase in the sale prices as the quality of the house increases.
overallqualBar <- ggplot(train,aes(x=OverallQual))+
geom_bar()+
scale_x_discrete(limits=seq(1,10))+
xlab("Overall Quality")
overallqualBox <- ggplot(train,aes(x=as.factor(OverallQual),y=SalePrice))+
geom_boxplot()+
scale_x_discrete(limits=seq(1,10))+
xlab("Overall Quality")
grid.arrange(overallqualBar,overallqualBox,ncol=2)
\(\textbf{Pick the next tab in order to see other variables.}\)
OverallCond: Rates the overall condition of the house.
There is one outlier for the sale price when the overall condition is 2 and another outlier when the overall condition is 6. These observations are 379 and 692.
ggplot(train,aes(x=as.character(OverallCond),y=SalePrice))+
geom_boxplot()+
scale_x_discrete(limits=seq(1,10))+
xlab("Overall Condition")
kable(train[train$OverallCond==2 & train$SalePrice>350000 & !is.na(train$SalePrice),])
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 379 | 379 | 20 | RL | 88 | 11394 | Pave | none | Reg | Lvl | AllPub | Corner | Gtl | StoneBr | Norm | Norm | 1Fam | 1Story | 9 | 2 | 2010 | 2010 | Hip | CompShg | VinylSd | VinylSd | Stone | 350 | Gd | TA | PConc | Ex | TA | Av | GLQ | 1445 | Unf | 0 | 411 | 1856 | GasA | Ex | Y | SBrkr | 1856 | 0 | 0 | 1856 | 1 | 0 | 1 | 1 | 1 | 1 | Ex | 8 | Typ | 1 | Ex | Attchd | 2010 | Fin | 3 | 834 | TA | TA | Y | 113 | 0 | 0 | 0 | 0 | 0 | none | none | none | 0 | 6 | 2010 | New | Partial | 394432 |
kable(train[train$OverallCond==6 & train$SalePrice>600000 & !is.na(train$SalePrice),])
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | X1stFlrSF | X2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | X3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 692 | 692 | 60 | RL | 104 | 21535 | Pave | none | IR1 | Lvl | AllPub | Corner | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 10 | 6 | 1994 | 1995 | Gable | WdShngl | HdBoard | HdBoard | BrkFace | 1170 | Ex | TA | PConc | Ex | TA | Gd | GLQ | 1455 | Unf | 0 | 989 | 2444 | GasA | Ex | Y | SBrkr | 2444 | 1872 | 0 | 4316 | 0 | 1 | 3 | 1 | 4 | 1 | Ex | 10 | Typ | 2 | Ex | Attchd | 1994 | Fin | 3 | 832 | TA | TA | Y | 382 | 50 | 0 | 0 | 0 | 0 | none | none | none | 0 | 1 | 2007 | WD | Normal | 755000 |
\(\textbf{Pick the next tab in order to see other variables.}\)
ExterQual: Evaluates the quality of the material on the exterior.
discreteggplot("ExterQual")
ExterCond: Evaluates the present condition of the material on the exterior. External quality and external condition will both be made into ordered predictors.
discreteggplot("ExterCond")
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtQual: Evaluates the height of the basement.
discreteggplot("BsmtQual")
BsmtCond: Evaluates the general condition of the basement. Both basement quality and basement condition will be made into ordered predictors.
discreteggplot("BsmtCond")
\(\textbf{Pick the next tab in order to see other variables.}\)
KitchenQual: Kitchen quality. Will be made into ordered predictor.
discreteggplot("KitchenQual")
\(\textbf{Pick the next tab in order to see other variables.}\)
FireplaceQu: Fireplace quality.
discreteggplot("FireplaceQu")
\(\textbf{Pick the next tab in order to see other variables.}\)
GarageQual: Garage quality.
discreteggplot("GarageQual")
GarageCond: Garage condition. Both garage quality and condition and will be made into ordered predictors.
discreteggplot("GarageCond")
\(\textbf{Pick the next tab in order to see other variables.}\)
PoolQC: Pool quality.
discreteggplot("PoolQC")
\(\textbf{Pick the next tab in order to see other variables.}\)
Street: Type of road access to property.
Can be made into ordered predictor with Paved>Gravel.
discreteggplot("Street")
\(\textbf{Pick the next tab in order to see other variables.}\)
LandSlope: Slope of property.
Made into ordered with Gtl>Mod>Sev.
discreteggplot("LandSlope")
\(\textbf{Pick the next tab in order to see other variables.}\)
HeatingQC: Heating quality and condition.
Heating quality, central air, and electrical will all be changed into ordered.
discreteggplot("HeatingQC")
CentralAir: Central air conditioning.
discreteggplot("CentralAir")
Electrical: Electrical system. Since Electrical is dominated by level of Circuit breaker the NA value will be replaced with this value in the feature engineering section.
discreteggplot("Electrical")
\(\textbf{Pick the next tab in order to see other variables.}\)
MasVnrType: Masonry veneer type.
Will be made into an ordered predictor with the sequential order of None<Brick Common<Brick Face<Stone. Cinder block is not in the data set so it will not be included, not sure where it would rank.
discreteggplot("MasVnrType")
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtExposure: Refers to walkout or garden level walls.
Can be made into ordered based on level of exposure.
discreteggplot("BsmtExposure")
\(\textbf{Pick the next tab in order to see other variables.}\)
GarageFinish: Interior finish of the garage.
Garage finish will be made into ordered.
discreteggplot("GarageFinish")
\(\textbf{Pick the next tab in order to see other variables.}\)
PavedDrive: Paved driveway.
discreteggplot("PavedDrive")
\(\textbf{Pick the next tab in order to see other variables.}\)
Fence: Fence quality.
It appears having no fence translates to a higher sales price than having a fence. Will convert this predictor to ordered with having no fence having the highest median sales price.
discreteggplot("Fence")
MSSubClass: Identifies the type of dwelling involved in the sale.
The type of dwelling is currently a numeric predictor, it should be a factor.
discreteggplot("MSSubClass")
\(\textbf{Pick the next tab in order to see other variables.}\)
MSZoning: Identifies the general zoning classification of the sale.
discreteggplot("MSZoning")
\(\textbf{Pick the next tab in order to see other variables.}\)
Alley: Type of alley access to property.
discreteggplot("Alley")
\(\textbf{Pick the next tab in order to see other variables.}\)
LotShape: General shape of property.
Could be an ordered predictor, box-plot doesn’t support this claim though so leaving as character.
discreteggplot("LotShape")
\(\textbf{Pick the next tab in order to see other variables.}\)
LandContour: Flatness of the property.
discreteggplot("LandContour")
\(\textbf{Pick the next tab in order to see other variables.}\)
LotConfig: Lot configuration.
discreteggplot("LotConfig")
\(\textbf{Pick the next tab in order to see other variables.}\)
Neighborhood: Physical locations within Ames city limits.
There exists twenty-five possible labels for neighborhood.
discreteggplot("Neighborhood")
train %>% group_by(Neighborhood) %>% summarize(count=n(),avgSalePrice=mean(SalePrice)) %>% arrange(desc(count))
## # A tibble: 25 × 3
## Neighborhood count avgSalePrice
## <chr> <int> <dbl>
## 1 NAmes 225 145847.
## 2 CollgCr 150 197966.
## 3 OldTown 113 128225.
## 4 Edwards 100 128220.
## 5 Somerst 86 225380.
## 6 Gilbert 79 192855.
## 7 NridgHt 77 316271.
## 8 Sawyer 74 136793.
## 9 NWAmes 73 189050.
## 10 SawyerW 59 186556.
## # … with 15 more rows
\(\textbf{Pick the next tab in order to see other variables.}\)
Condition1: Proximity to various conditions.
discreteggplot("Condition1")
table(train$Condition1)
##
## Artery Feedr Norm PosA PosN RRAe RRAn RRNe RRNn
## 48 81 1260 8 19 11 26 2 5
Condition2: Proximity to various conditions (if more than one is present).
87% of the houses have the same condition1 and condition2 value, one of the columns will likely be dropped.
discreteggplot("Condition2")
round(sum((train$Condition1==train$Condition2)/nrow(train)),2)
## [1] 0.87
\(\textbf{Pick the next tab in order to see other variables.}\)
BldgType: Type of dwelling.
discreteggplot("BldgType")
HouseStyle: Style of dwelling.
House Style has some rare labels that may require grouping.
discreteggplot("HouseStyle")
\(\textbf{Pick the next tab in order to see other variables.}\)
RoofStyle: Type of roof.
discreteggplot("RoofStyle")
RoofMatl: Roof material.
Roof material has four levels that only occur once. These levels will either need to be dropped or grouped.
discreteggplot("RoofMatl")
table(train$RoofMatl)
##
## ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
## 1 1434 1 1 1 11 5 6
\(\textbf{Pick the next tab in order to see other variables.}\)
Exterior1st: Exterior covering on house.
exterior1stBar <- ggplot(train,aes(x=Exterior1st))+
geom_bar()+
theme(axis.text.x = element_text(angle = 45))
exterior1stBox <- ggplot(train[!is.na(train$SalePrice),],aes(x=Exterior1st,y=SalePrice))+
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45))
grid.arrange(exterior1stBar,exterior1stBox, ncol=2)
Exterior2nd: Exterior covering on house (if more than one material).
Most of the time exterior1st is the same as exterior2nd, one of the predictors will likely be dropped.
exterior2ndBar <- ggplot(train,aes(x=Exterior2nd))+
geom_bar()+
theme(axis.text.x = element_text(angle = 45))
exterior2ndBox <- ggplot(train[!is.na(train$SalePrice),],aes(x=Exterior2nd,y=SalePrice))+
geom_boxplot()+
theme(axis.text.x = element_text(angle = 45))
grid.arrange(exterior2ndBar,exterior2ndBox, ncol=2)
round(sum(train$Exterior1st==train$Exterior2nd)/nrow(train),2)
## [1] 0.85
\(\textbf{Pick the next tab in order to see other variables.}\)
Foundation: Type of foundation.
discreteggplot("Foundation")
\(\textbf{Pick the next tab in order to see other variables.}\)
BsmtFinType1: Rating of basement finished area.
discreteggplot("BsmtFinType1")
BsmtFinType2: Rating of basement finished area (if multiple types).
From the plots it is shown that the sale price did not increase as would be expected when moving from no basement to good living quarters. Because of this, these two predictors will remain character instead of being made into ordered. Almost all of basement finish type2 are of the unfinished class.
discreteggplot("BsmtFinType2")
\(\textbf{Pick the next tab in order to see other variables.}\)
Heating: Type of heating.
Nearly all houses have gas forced heating.
discreteggplot("Heating")
table(train$Heating)
##
## Floor GasA GasW Grav OthW Wall
## 1 1428 18 7 2 4
\(\textbf{Pick the next tab in order to see other variables.}\)
Functional: Home functionality (Assume typical unless deductions are warranted).
Nearly all values are typical.
discreteggplot("Functional")
\(\textbf{Pick the next tab in order to see other variables.}\)
GarageType: Garage location.
discreteggplot("GarageType")
\(\textbf{Pick the next tab in order to see other variables.}\)
MiscFeature: Miscellaneous feature not covered in other categories.
96% of houses do not have a miscellaneous feature.
discreteggplot("MiscFeature")
round(sum(train$MiscFeature=="none")/nrow(train),2)
## [1] 0.96
\(\textbf{Pick the next tab in order to see other variables.}\)
SaleType: Type of sale.
discreteggplot("SaleType")
SaleCondition: Condition of sale.
Both are dominated by one label and have a number of rare labels.
discreteggplot("SaleCondition")
MasVnrArea and MasVnrType:
In kaggle competitions we have access to the test set so this allows us to take a peak for any possible observations with data entry errors or weird values. The masonry veneer area and type exhibit the same problems that occurred in the train set.
test %>% filter(MasVnrArea!=0 & MasVnrType=="None") %>% select(Id,MasVnrArea,MasVnrType)
## Id MasVnrArea MasVnrType
## 1 1670 285 None
## 2 2453 1 None
test %>% filter(MasVnrArea==0 & MasVnrType!="None") %>% select(Id,MasVnrArea,MasVnrType)
## Id MasVnrArea MasVnrType
## 1 2320 0 BrkFace
Now that the data analysis portion is over a random forest model will be run to develop a baseline score in which we hope to beat.
Train Control:
custom_summary = function(data, lev = NULL, model = NULL) {
library(Metrics)
out = rmsle(data[, "obs"], data[, "pred"])
names(out) = c("rmsle")
out
}
ctrl <- trainControl(method="cv",
number=10,
summaryFunction = custom_summary,
allowParallel = TRUE)
Removing near zero variances.
noVariance <- nearZeroVar(train)
noNearZeroVar <- train[,-noVariance]
Random Forest Model:
set.seed(123)
rfBase <- train(SalePrice~.,data=noNearZeroVar[complete.cases(train),-1],
trControl=ctrl,
metric="rmsle",
maximize=FALSE,
tuneGrid=expand.grid(mtry=seq(20,40,by=5)))
saveRDS(rfBase,"rfBase.rds")
rfBase <- readRDS("rfBase.rds")
min(rfBase$results$rmsle)
## [1] 0.1394914
Variable Importance:
From the baseline random forest model we are also able to gather the variable importance in predicting sales price. The top 10 listed important variables are overall quality, above ground living area, garage cars, external quality, total basement square feet, kitchen quality, 1st floor square feet, garage area, year built, and basement quality.
varImp(rfBase)
## rf variable importance
##
## only 20 most important variables shown (out of 193)
##
## Overall
## OverallQual 100.000
## GrLivArea 50.372
## GarageCars 36.472
## ExterQualTA 32.607
## TotalBsmtSF 32.597
## YearBuilt 30.442
## X1stFlrSF 24.929
## GarageArea 22.189
## X2ndFlrSF 16.169
## BsmtFinSF1 15.273
## FullBath 11.267
## LotArea 10.280
## GarageYrBlt 9.800
## TotRmsAbvGrd 9.520
## YearRemodAdd 7.377
## MasVnrArea 6.168
## LotFrontage 5.326
## FoundationPConc 5.229
## KitchenQualTA 4.664
## ExterQualGd 4.451
Takeaways from the data analysis section:
Corrections to MasVnrArea and MasVnrType are needed.
Imputation is still required for missing values.
There are some predictors we can add to our model.
There are near zero variance predictors in the data set, meaning they supply little information.
Highly correlated predictors exist in the data set.
There are many predictors that can be represented as ordered predictors.
Some variables have too many levels which can lead to over fitting.
There was one outlier value that kept appearing.
Mode function:
mode <- function(x){
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
MasVnrArea and MasVnrType:
Masonry veneer area and type have errors such as an area of zero when a type is listed or an area above zero when listed as none. These issues will fixed by taking the median values for the masonry veneer area and the second most common masonry veneer type which is brick face for masonry veneer type.
train[train$Id==689,"MasVnrArea"] <- median(train[train$MasVnrType=="BrkFace" & !is.na(train$MasVnrArea),"MasVnrArea"])
train[train$Id==1242,"MasVnrArea"] <- median(train[train$MasVnrType=="Stone" & !is.na(train$MasVnrArea),"MasVnrArea"])
train[train$Id%in%c(774,1231),"MasVnrArea"] <- 0
train[train$Id %in% c(625,1301,1335) & train$MasVnrType=="None","MasVnrType"] <- "BrkFace"
Applying the same to the test set.
test[test$Id==2453,"MasVnrArea"] <- 0
test[test$Id==1670,"MasVnrType"] <- "BrkFace"
test[test$Id==2320,"MasVnrArea"] <- median(train[train$MasVnrType=="BrkFace" & !is.na(train$MasVnrArea),"MasVnrArea"])
Imputation:
For Electrical and MasVnrType mode imputation will be used because there are only a total on nine NA values between the two. Any changes applied to the train data set should also be applied to the test.
test[is.na(test$Electrical),"Electrical"] <- mode(train[!is.na(train$Electrical),"Electrical"])
test[is.na(test$MasVnrType),"MasVnrType"] <- mode(train[!is.na(train$MasVnrType),"MasVnrType"])
train[is.na(train$Electrical),"Electrical"] <- mode(train[!is.na(train$Electrical),"Electrical"])
train[is.na(train$MasVnrType),"MasVnrType"] <- mode(train[!is.na(train$MasVnrType),"MasVnrType"])
For MasVnrArea there are only eight NA values, median imputation will be used.
test[is.na(test$MasVnrArea),"MasVnrArea"] <-
median(train[!is.na(train$MasVnrArea),"MasVnrArea"])
train[is.na(train$MasVnrArea),"MasVnrArea"] <- median(train[!is.na(train$MasVnrArea),"MasVnrArea"])
KNN imputation will be used for lot frontage. Since KNN imputation center and scales the data the reverse of the operations will be applied to change lot frontage back to its original form.
library(RANN)
lotFrontageMean <- mean(train[!is.na(train$LotFrontage),"LotFrontage"])
sdLotFrontage <- sd(train[!is.na(train$LotFrontage),"LotFrontage"])
imputeMissing <- preProcess(train[,c(-1,-81)],"knnImpute") # 1=ID, 81=SalePrice
imputedMissingTrain <- predict(imputeMissing,train)
# unstandardizing the data
train$LotFrontage <- (imputedMissingTrain$LotFrontage*sdLotFrontage)+lotFrontageMean
Same imputation method will be used for the test set.
imputedMissingTest <- predict(imputeMissing,test)
# unstandardizing the data
test$LotFrontage <- (imputedMissingTest$LotFrontage*sdLotFrontage)+lotFrontageMean
Garage year built has intrinsic missing values which represent not having a garage. Since it is highly correlated with the year the house was built, imputation will not be needed as it will be dropped later on.
Test NAs:
There are remaining NA values in the test set. Looking at the basement and garage NA’s it appears those observations did not have a basement or garage so the NA values are replaced with zeros. For the remaining NA values mode imputation is used. Again we are skipping the year the garage was built because that variable will be dropped.
testNAs <- as.data.frame(sapply(test,function(x) sum(is.na(x))))
names(testNAs) <- "NA_Count"
testNAs$variable <- row.names(testNAs)
testNAs %>% arrange(desc(NA_Count)) %>% filter(NA_Count>0)
## NA_Count variable
## GarageYrBlt 78 GarageYrBlt
## MSZoning 4 MSZoning
## Utilities 2 Utilities
## BsmtFullBath 2 BsmtFullBath
## BsmtHalfBath 2 BsmtHalfBath
## Functional 2 Functional
## Exterior1st 1 Exterior1st
## Exterior2nd 1 Exterior2nd
## BsmtFinSF1 1 BsmtFinSF1
## BsmtFinSF2 1 BsmtFinSF2
## BsmtUnfSF 1 BsmtUnfSF
## TotalBsmtSF 1 TotalBsmtSF
## KitchenQual 1 KitchenQual
## GarageCars 1 GarageCars
## GarageArea 1 GarageArea
## SaleType 1 SaleType
test[,grepl("Bsmt",names(test))] %>% filter(is.na(BsmtFullBath))
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 none none none none NA none
## 2 none none none none 0 none
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath BsmtHalfBath
## 1 NA NA NA NA NA
## 2 0 0 0 NA NA
test[test$Id==2121,c("BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","BsmtFullBath","BsmtHalfBath")] <- 0
test[test$Id==2189,c("BsmtFullBath","BsmtHalfBath")] <- 0
test[,grepl("Garage",names(test))] %>% filter(is.na(GarageArea))
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Detchd NA none NA NA none
## GarageCond
## 1 none
test[test$Id==2577,c("GarageCars","GarageArea")] <- 0
test[is.na(test$MSZoning),"MSZoning"] <- mode(train$MSZoning)
test[is.na(test$Utilities),"Utilities"] <- mode(train$Utilities)
test[is.na(test$Functional),"Functional"] <- mode(train$Functional)
test[is.na(test$Exterior1st),"Exterior1st"] <- mode(train$Exterior1st)
test[is.na(test$Exterior2nd),"Exterior2nd"] <- mode(train$Exterior2nd)
test[is.na(test$KitchenQual),"KitchenQual"] <- mode(train$KitchenQual)
test[is.na(test$SaleType),"SaleType"] <- mode(train$SaleType)
Adding Predictors:
The predictors added to our data will be whether or not a house had a garage built after the house was built, total rooms that are neither bathrooms or kitchens, total number of bathrooms, whether a house has a porch, total porch sf, and whether or not a house had a remodel.
train <- train %>%
mutate(newGarage=ifelse(!is.na(GarageYrBlt) & YearBuilt!=GarageYrBlt ,1,0),
otherRooms=TotRmsAbvGrd-FullBath-HalfBath,
totalBathRoom=FullBath+HalfBath+BsmtFullBath+BsmtHalfBath,
hasPorch=ifelse(WoodDeckSF+OpenPorchSF+EnclosedPorch+X3SsnPorch>0,1,0),
totalPorchSF=WoodDeckSF+OpenPorchSF+EnclosedPorch+X3SsnPorch,
hasRemodel=ifelse(YearBuilt!=YearRemodAdd,1,0))
test <- test %>%
mutate(newGarage=ifelse(!is.na(GarageYrBlt) & YearBuilt!=GarageYrBlt ,1,0),
otherRooms=TotRmsAbvGrd-FullBath-HalfBath,
totalBathRoom=FullBath+HalfBath+BsmtFullBath+BsmtHalfBath,
hasPorch=ifelse(WoodDeckSF+OpenPorchSF+EnclosedPorch+X3SsnPorch>0,1,0),
totalPorchSF=WoodDeckSF+OpenPorchSF+EnclosedPorch+X3SsnPorch,
hasRemodel=ifelse(YearBuilt!=YearRemodAdd,1,0))
When looking at the year the house was built compared to the sales price it appears before a certain period the age of the house does not matter. This point is found using a mars model. Houses built before 1971 have less affect for a one year increase in the age of the house on sale price compared to a house built after 1971. A newly created variable of ‘OldHouse’ will be used to address this.
# Building mars model to find hinge point
yearBuiltBinned <- earth(SalePrice~YearBuilt,train[,-1],nprune = 2)
#building data frame of predicted values from the mars model
predictedValues <- data.frame(YearBuilt=train["YearBuilt"],predict(yearBuiltBinned,train))
#plotting what it looks like with the hinge function
predictedValues %>%
ggplot(.,aes(YearBuilt,SalePrice))+
geom_smooth(color="red",size=3)+
geom_point(data=train,aes(x=YearBuilt,y=SalePrice))+
geom_point()+
annotate("segment",x=1971,xend=1971,y=0,yend=350000,color="red")+
labs(title="Hinge at 1971")
#adding the binary variable
train$OldHouse <- ifelse(train$YearBuilt<=1971,1,0)
test$OldHouse <- ifelse(test$YearBuilt<=1971,1,0)
Near-Zero Variance Predictors:
Near-zero variance predictors enter extra variance into our model while supplying little information. The predictors in this first iteration of feature engineering that will be dropped will be chosen very conservatively. Only the predictors of Street, Utilities, Heating, and PoolQC will be dropped.
noVariance <- nearZeroVar(train,freqCut = 40,uniqueCut = .5,saveMetrics = TRUE)
noVariance %>% filter(nzv==TRUE)
## freqRatio percentUnique zeroVar nzv
## Street 242.33333 0.1369863 FALSE TRUE
## Utilities 1459.00000 0.1369863 FALSE TRUE
## Heating 79.33333 0.4109589 FALSE TRUE
## PoolQC 484.33333 0.2739726 FALSE TRUE
train <- train %>% select(-Street,-Utilities,-Heating,-PoolQC)
test <- test %>% select(-Street,-Utilities,-Heating,-PoolQC)
Correlated Predictors:
The highly correlated numeric predictors we found that will be dropped are total rooms above ground, first floor sf, garage area, the year the garage was built, and WoodDeckSF.
numericPreds <- train[,c(-1,-77)] %>% select_if(is.numeric)
correlations <- as.data.frame(as.table(cor(numericPreds,use="complete.obs"))) %>% filter(Freq!=1) %>% arrange(desc(Freq))
correlations <- correlations[seq(1,nrow(correlations),by=2),]
head(correlations)
## Var1 Var2 Freq
## 1 otherRooms TotRmsAbvGrd 0.8700782
## 3 GarageArea GarageCars 0.8314807
## 5 GarageYrBlt YearBuilt 0.8256675
## 7 X1stFlrSF TotalBsmtSF 0.8224694
## 9 TotRmsAbvGrd GrLivArea 0.8209748
## 11 totalPorchSF WoodDeckSF 0.8099937
train <- train %>% select(-TotRmsAbvGrd,-X1stFlrSF,-GarageArea,-GarageYrBlt,-WoodDeckSF)
test <- test %>% select(-TotRmsAbvGrd,-X1stFlrSF,-GarageArea,-GarageYrBlt,-WoodDeckSF)
From the data analysis there are some categorical predictors that have nearly all the same labels. These variables will be dropped now.
train <- train %>% select(-Condition2,-Exterior2nd,-BsmtFinType2)
test <- test %>% select(-Condition2,-Exterior2nd,-BsmtFinType2)
Ordered Predictors:
Many of the predictors can be represented as ordered predictors. This will lower the amount of factor levels which will improve computation time and can lower the chance of over fitting.
noNone <- c('Po' = 0, 'Fa' = 1, 'TA' = 2, 'Gd' = 3, 'Ex' = 4)
withNone <-c('none' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
train$ExterQual<-as.integer(revalue(train$ExterQual, noNone))
train$ExterCond<-as.integer(revalue(train$ExterCond, noNone))
train$KitchenQual <- as.integer(revalue(train$KitchenQual,noNone))
train$HeatingQC <- as.integer(revalue(train$HeatingQC,noNone))
train$BsmtQual <- as.integer(revalue(train$BsmtQual,withNone))
train$BsmtCond <- as.integer(revalue(train$BsmtCond,withNone))
train$FireplaceQu <- as.integer(revalue(train$FireplaceQu,withNone))
train$GarageQual <- as.integer(revalue(train$GarageQual,withNone))
train$GarageCond <- as.integer(revalue(train$GarageCond,withNone))
test$ExterQual<-as.integer(revalue(test$ExterQual, noNone))
test$ExterCond<-as.integer(revalue(test$ExterCond, noNone))
test$KitchenQual <- as.integer(revalue(test$KitchenQual,noNone))
test$HeatingQC <- as.integer(revalue(test$HeatingQC,noNone))
test$BsmtQual <- as.integer(revalue(test$BsmtQual,withNone))
test$BsmtCond <- as.integer(revalue(test$BsmtCond,withNone))
test$FireplaceQu <- as.integer(revalue(test$FireplaceQu,withNone))
test$GarageQual <- as.integer(revalue(test$GarageQual,withNone))
test$GarageCond <- as.integer(revalue(test$GarageCond,withNone))
train$LandSlope <- as.integer(revalue(train$LandSlope,c("Gtl"=0,"Mod"=1,"Sev"=2)))
train$CentralAir <- as.integer(revalue(train$CentralAir,c('N'=0,'Y'=1)))
train$Electrical <- as.integer(revalue(train$Electrical,c('Mix'=0,'FuseP'=1,'FuseF'=2,'FuseA'=3,'SBrkr'=4)))
train$MasVnrType <- as.integer(revalue(train$MasVnrType,c('None'=0,'BrkCmn'=1,'BrkFace'=2,'Stone'=3)))
train$BsmtExposure <- as.integer(revalue(train$BsmtExposure,c('none'=0,'No'=1,'Mn'=2,'Av'=3,'Gd'=4)))
train$GarageFinish <- as.integer(revalue(train$GarageFinish,c('none'=0,'Unf'=1,'RFn'=2,'Fin'=3)))
train$PavedDrive <- as.integer(revalue(train$PavedDrive,c('N'=0,'P'=1,'Y'=2)))
train$Fence <- as.integer(revalue(train$Fence,c('MnWw'=0,'GdWo'=1,'MnPrv'=2,'GdPrv'=3,'none'=4)))
test$LandSlope <- as.integer(revalue(test$LandSlope,c("Gtl"=0,"Mod"=1,"Sev"=2)))
test$CentralAir <- as.integer(revalue(test$CentralAir,c('N'=0,'Y'=1)))
test$Electrical <- as.integer(revalue(test$Electrical,c('Mix'=0,'FuseP'=1,'FuseF'=2,'FuseA'=3,'SBrkr'=4)))
test$MasVnrType <- as.integer(revalue(test$MasVnrType,c('None'=0,'BrkCmn'=1,'BrkFace'=2,'Stone'=3)))
test$BsmtExposure <- as.integer(revalue(test$BsmtExposure,c('none'=0,'No'=1,'Mn'=2,'Av'=3,'Gd'=4)))
test$GarageFinish <- as.integer(revalue(test$GarageFinish,c('none'=0,'Unf'=1,'RFn'=2,'Fin'=3)))
test$PavedDrive <- as.integer(revalue(test$PavedDrive,c('N'=0,'P'=1,'Y'=2)))
test$Fence <- as.integer(revalue(test$Fence,c('MnWw'=0,'GdWo'=1,'MnPrv'=2,'GdPrv'=3,'none'=4)))
Grouping Rare Levels:
Any levels with less than fifteen occurrences in the train set will be grouped. Most of these will be done by replacing the value with the mode. In other cases the rare levels will be grouped with a level that is similar.
train$MSSubClass <- as.character(train$MSSubClass)
test$MSSubClass <- as.character(test$MSSubClass)
train[train$MSSubClass %in% c("40","80"),"MSSubClass"] <- mode(train$MSSubClass)
train[train$MSZoning %in% c("C (all)"),"MSZoning"] <- mode(train$MSZoning)
train[train$LotShape %in% c("IR1","IR2","IR3"),"LotShape"] <- "irregular"
train[train$LotConfig %in% c("FR2","FR3"),"LotConfig"] <- "twoPlus"
train[train$Neighborhood %in% c("Blueste","NPkVill","Veenker"),"Neighborhood"] <- mode(train$Neighborhood)
train[train$Condition1 %in% c("RRAe","RRAn","RRNe","RRNn"),"Condition1"] <- "nearRailroad"
train[train$RoofStyle %in% c("Flat","Gambrel","Mansard","Shed"),"RoofStyle"] <- mode(train$RoofStyle)
train[train$Foundation %in% c("Stone","Wood"),"Foundation"] <- mode(train$Foundation)
train[train$Functional %in% c("Maj1","Maj2","Sev"),"Functional"] <- "other"
train[train$GarageType %in% c("2Types","CarPort"),"GarageType"] <- "other"
train[train$MiscFeature %in% c("Gar2","Othr","Shed","TenC"),"MiscFeature"] <- "hasMisc"
train[train$SaleType %in% c("Con","ConLD","ConLI","ConLw","CWD","Oth"),"SaleType"] <- "other"
train[train$SaleCondition %in% c("AdjLand","Alloca","Family"),"SaleCondition"] <- "other"
Same for the test.
test[test$MSSubClass %in% c("40","80"),"MSSubClass"] <- mode(test$MSSubClass)
test[test$MSZoning %in% c("C (all)"),"MSZoning"] <- mode(test$MSZoning)
test[test$LotShape %in% c("IR1","IR2","IR3"),"LotShape"] <- "irregular"
test[test$LotConfig %in% c("FR2","FR3"),"LotConfig"] <- "twoPlus"
test[test$Neighborhood %in% c("Blueste","NPkVill","Veenker"),"Neighborhood"] <- mode(test$Neighborhood)
test[test$Condition1 %in% c("RRAe","RRAn","RRNe","RRNn"),"Condition1"] <- "nearRailroad"
test[test$RoofStyle %in% c("Flat","Gambrel","Mansard","Shed"),"RoofStyle"] <- mode(test$RoofStyle)
test[test$Foundation %in% c("Stone","Wood"),"Foundation"] <- mode(test$Foundation)
test[test$Functional %in% c("Maj1","Maj2","Sev"),"Functional"] <- "other"
test[test$GarageType %in% c("2Types","CarPort"),"GarageType"] <- "other"
test[test$MiscFeature %in% c("Gar2","Othr","Shed","TenC"),"MiscFeature"] <- "hasMisc"
test[test$SaleType %in% c("Con","ConLD","ConLI","ConLw","CWD","Oth"),"SaleType"] <- "other"
test[test$SaleCondition %in% c("AdjLand","Alloca","Family"),"SaleCondition"] <- "other"
Month sold should be a string instead of an integer because there is no clear ordering to it.
train$MoSold <- as.character(train$MoSold)
test$MoSold <- as.character(test$MoSold)
Removing Outlier:
Observation 1299 of the train data kept appearing as an outlier so will be removed from the data set.
train <- train[-1299,]
Another random forest model will be run to see if the results beat our baseline model. The random forest model requires little to no pre-processing.
set.seed(12334)
rfMod <- train(SalePrice~.,data=train[,-1],#1=Id
trControl=ctrl,
metric="rmsle",
maximize=FALSE,
tuneGrid=expand.grid(mtry=seq(20,40,by=5)))
saveRDS(rfMod,"rfMod.rds")
The changes made created a slight improvement in our results.
rfMod <- readRDS("rfMod.rds")
min(rfMod$results$rmsle)
## [1] 0.1354918
data.frame(model=c("baseline","rfModel"),rmsle=c(min(rfBase$results$rmsle),min(rfMod$results$rmsle)))
## model rmsle
## 1 baseline 0.1394914
## 2 rfModel 0.1354918
The following models will require us to make changes to the data such as standardization and dummy variables.
Standardization:
makeStandard <- preProcess(train[,c(-1,-69)],method=c("center","scale"))#1=Id,69=SalePrice
standardTrain <- predict(makeStandard,train)
Dummy Variables:
After creating dummy variables and dropping id there are a total of 157 variables, with one of them being the dependent variable sales price.
for(name in names(standardTrain)){
if(is.character(train[,name])){
train[,name] <- as.factor(train[,name])
}
}
makeDummy <- dummyVars(~.,standardTrain,fullRank = TRUE)
trainClean <- predict(makeDummy,standardTrain)
Lasso:
set.seed(342)
lassoMod <- train(SalePrice~.,data=trainClean[,-1],#1=Id
method="lasso",
trControl=ctrl,
tuneGrid=expand.grid(fraction=seq(.05,.8,by=.05)),
metric="rmsle",
maximize=FALSE)
saveRDS(lassoMod,"lassoMod.rds")
lassoMod <- readRDS("lassoMod.rds")
min(lassoMod$results$rmsle)
## [1] 0.1472279
Cubist Mod:
set.seed(1234)
cubistMod <- train(SalePrice~.,data=trainClean[,-1],#1=Id
method="cubist",
trControl=ctrl,
metric="rmsle",
maximize=FALSE,
tuneGrid=expand.grid(committees=seq(70,100,by=5),neighbors=c(7,8,9)))
saveRDS(cubistMod,"cubistMod.rds")
cubistMod <- readRDS("cubistMod.rds")
min(cubistMod$results$rmsle)
## [1] 0.1220886
Round 1 Results:
The cubist model performed the best of the three chosen models in the first round of modeling.
modelResults <- resamples(list(lasso=lassoMod,
cubist=cubistMod,
randomForest=rfMod))
bwplot(modelResults,metric="rmsle")
Near Zero Variance:
When near zero variance predictors were removed before the parameters for what define a near zero variance predictor were very strict. Now the default parameters will be used. This drops an additional fifteen variables.
test <- test[,-nearZeroVar(train)]
train <- train[,-nearZeroVar(train)]
train$OverallCondSQ <- train$OverallQual^2
train$GrLivAreaSQ <- train$GrLivArea^2
train$GarageCarsSQ <- train$GarageCars^2
train$TotalBsmtSFSQ <- train$TotalBsmtSF^2
train$ExterQualSQ <- train$ExterQual^2
test$OverallCondSQ <- test$OverallQual^2
test$GrLivAreaSQ <- test$GrLivArea^2
test$GarageCarsSQ <- test$GarageCars^2
test$TotalBsmtSFSQ <- test$TotalBsmtSF^2
test$ExterQualSQ <- test$ExterQual^2
Standardizing Data:
The numeric predictors will be standardized again.
SalePrice <- train$SalePrice
train <- train %>% select(-SalePrice)
makeStandard <- preProcess(train,method=c("center","scale"))
standardTrain <- predict(makeStandard,train)
standardTest <- predict(makeStandard,test)
The train and test data set will be merged so that the dummy variables are the same across the two sets. The first level of each categorical variable will be set to the mode. This will be important for when we convert our variables into dummy variables the mode of each category will become the baseline. If we were to drop a dummy variable because of its rareness it will be the same as converting our rare variable into the mode level.
combined <- rbind(standardTrain,standardTest)
# making the mode the reference level and changing characters into factors
for(name in names(combined)){
if(is.character(combined[,name])){
combined[,name] <- as.factor(combined[,name])
baseLine <- mode(combined[,name])
combined[,name] <- relevel(combined[,name],ref=as.character(baseLine))
}
}
makeDummy <- dummyVars(~.,data=combined,fullRank = TRUE)# id=1,saleprice=70
combinedClean <- predict(makeDummy,combined)
combinedClean <- as.data.frame(combinedClean)
trainClean <- combinedClean[1:1459,]
trainClean$SalePrice <- SalePrice
testClean <- combinedClean[1460:2918,]
For some of the more rare labels it is possible for them to only exist in the train set or only exist in the test set. If they only exist in the train set it will lead to errors when trying to make predictions on the test set. If they exist only in the test set it will lead to unnecessary variance when building our models.
# names not in the train set
for(name in names(trainClean)){
if(sum(trainClean[,name])==0){
print(name)
}
}
## [1] "MSSubClass.150"
trainClean <- trainClean %>%
select(-MSSubClass.150)
testClean <- testClean %>%
select(-MSSubClass.150)
# names not in the test set
for(name in names(testClean)){
if(sum(testClean[,name])==0){
print(name)
}
}
## [1] "HouseStyle.2.5Fin"
## [1] "Exterior1st.ImStucc"
## [1] "Exterior1st.Stone"
trainClean <- trainClean %>%
select(-HouseStyle.2.5Fin,-Exterior1st.ImStucc,-Exterior1st.Stone)
testClean <- testClean %>%
select(-HouseStyle.2.5Fin,-Exterior1st.ImStucc,-Exterior1st.Stone)
High Correlations:
Correlated values above .9 will be removed from the train and test set.
correlatedColumns <- findCorrelation(cor(trainClean[,c(-1,-149)]),cutoff = .9)
testClean <- testClean[,-correlatedColumns]
trainClean <- trainClean[,-correlatedColumns]
trainClean$SalePrice <- log(trainClean$SalePrice)
ctrl <- trainControl(method="cv",number=10)
Lasso 2:
set.seed(342)
lassoMod2 <- train(SalePrice~.,data=trainClean[,-1],#1=Id
method="lasso",
trControl=ctrl,
#metric="rmsle",
tuneGrid=expand.grid(fraction=seq(.2,.7,by=.05)),
maximize=FALSE)
saveRDS(lassoMod2,"lassoMod2.rds")
lassoMod2 <- readRDS("lassoMod2.rds")
min(lassoMod2$results$RMSE)
## [1] 0.1227373
Cubist Mod 2:
set.seed(1234)
cubistMod2 <- train(SalePrice~.,data=trainClean[,-1],
method="cubist",
trControl=ctrl,
#metric="rmsle",
maximize=FALSE,
tuneGrid=expand.grid(committees=seq(70,100,by=5),neighbors=c(7,8,9)))
saveRDS(cubistMod2,"cubistMod2.rds")
cubistMod2 <- readRDS("cubistMod2.rds")
min(cubistMod2$results$RMSE)
## [1] 0.120205
Results:
Both the lasso and cubist performed slightly better after the second round of feature engineering.
modelResults2 <- resamples(list(lasso=lassoMod2,
cubist=cubistMod2))
bwplot(modelResults,metric="rmsle")
bwplot(modelResults2,metric="RMSE")
The final model submitted was the cubist2 model. It achieved a rmsle of .12077 on the test set.
#exp to transform back to original units
outputDF <- data.frame(ID=test$Id,SalePrice=exp(predict(cubistMod2,testClean)))
write.csv(x = outputDF,file = "submission2.csv",row.names=FALSE)